In this study, we examine air accidents and predict fatality or seriousness of accidents.
What factors affect the occurrence of air accidents?
To what extent are machine learning algorithms used in predicting air accidents?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sweetviz as sv
data = pd.read_excel('E:\IUST\ترم 8\قابلیت اطمینان انسانی\\Dataset for HRA.xlsx')
data.head()
| EventId | Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | ... | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | 91= namely general aviation pilots | 121 = qualified pilots | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20060629X00856 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 | NaN | NaN |
| 1 | 20060113X00068 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 4 | 1 | NaN | NaN |
| 2 | 20060929X01431 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 | NaN | NaN |
| 3 | 20070109X00026 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 7 | 1 | NaN | NaN |
| 4 | 20060719X00965 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 | NaN | NaN |
5 rows × 28 columns
data.tail()
| EventId | Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | ... | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | 91= namely general aviation pilots | 121 = qualified pilots | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 474 | 20150717X41748 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 5 | 0 | NaN | NaN |
| 475 | 20151022X65901 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | NaN | NaN |
| 476 | 20151130X61148 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 5 | 0 | NaN | NaN |
| 477 | 20151213X84149 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 5 | 0 | NaN | NaN |
| 478 | 20150605X52542 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 6 | 0 | NaN | NaN |
5 rows × 28 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 479 entries, 0 to 478 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EventId 479 non-null object 1 Performance-Based Errors 479 non-null int64 2 Judgment & Decision-Making Errors 479 non-null int64 3 Violations 479 non-null int64 4 Teamwork 479 non-null int64 5 Mental Awareness 479 non-null int64 6 State of Mind 479 non-null int64 7 Physical Problems 479 non-null int64 8 Sensory Misperception 479 non-null int64 9 Physical Environment 479 non-null int64 10 Technological Environment 479 non-null int64 11 Inadequate Supervision 479 non-null int64 12 Planned Inappropriate Operations 479 non-null int64 13 Supervisory Violations 479 non-null int64 14 Resource Problems 479 non-null int64 15 Personnel Selection & Staffing 479 non-null int64 16 Climate/ Culture Influences 479 non-null int64 17 Policy & Process Issues 479 non-null int64 18 Technology Failure 479 non-null int64 19 Acts 479 non-null int64 20 Preconditions 479 non-null int64 21 Supervision 479 non-null int64 22 Organization 479 non-null int64 23 Fatal or Serious 479 non-null int64 24 Flight Segment 1=Taxi 479 non-null int64 25 91=1/121=0 479 non-null int64 26 91= namely general aviation pilots 0 non-null float64 27 121 = qualified pilots 0 non-null float64 dtypes: float64(2), int64(25), object(1) memory usage: 104.9+ KB
data.describe()
| Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | Technological Environment | ... | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | 91= namely general aviation pilots | 121 = qualified pilots | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | ... | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 479.000000 | 0.0 | 0.0 |
| mean | 0.480167 | 0.127349 | 0.070981 | 0.035491 | 0.016701 | 0.004175 | 0.039666 | 0.018789 | 0.162839 | 0.002088 | ... | 0.319415 | 0.638831 | 0.252610 | 0.045929 | 0.048017 | 0.354906 | 4.951983 | 0.678497 | NaN | NaN |
| std | 0.500129 | 0.333712 | 0.257062 | 0.185210 | 0.128284 | 0.064549 | 0.195377 | 0.135922 | 0.369605 | 0.045691 | ... | 0.466738 | 0.480842 | 0.434963 | 0.209550 | 0.214025 | 0.478985 | 2.026960 | 0.467542 | NaN | NaN |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | NaN | NaN |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | NaN | NaN |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 1.000000 | NaN | NaN |
| 75% | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 7.000000 | 1.000000 | NaN | NaN |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 7.000000 | 1.000000 | NaN | NaN |
8 rows × 27 columns
data.columns
Index(['EventId', 'Performance-Based Errors',
'Judgment & Decision-Making Errors', 'Violations', 'Teamwork',
'Mental Awareness', 'State of Mind', 'Physical Problems',
'Sensory Misperception', 'Physical Environment',
'Technological Environment', 'Inadequate Supervision',
'Planned Inappropriate Operations', 'Supervisory Violations',
'Resource Problems', 'Personnel Selection & Staffing',
'Climate/ Culture Influences', 'Policy & Process Issues',
'Technology Failure', 'Acts', 'Preconditions', 'Supervision',
'Organization', 'Fatal or Serious', 'Flight Segment 1=Taxi',
'91=1/121=0', '91= namely general aviation pilots',
'121 = qualified pilots'],
dtype='object')
data.drop(['91= namely general aviation pilots','121 = qualified pilots'],axis=1,inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 479 entries, 0 to 478 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EventId 479 non-null object 1 Performance-Based Errors 479 non-null int64 2 Judgment & Decision-Making Errors 479 non-null int64 3 Violations 479 non-null int64 4 Teamwork 479 non-null int64 5 Mental Awareness 479 non-null int64 6 State of Mind 479 non-null int64 7 Physical Problems 479 non-null int64 8 Sensory Misperception 479 non-null int64 9 Physical Environment 479 non-null int64 10 Technological Environment 479 non-null int64 11 Inadequate Supervision 479 non-null int64 12 Planned Inappropriate Operations 479 non-null int64 13 Supervisory Violations 479 non-null int64 14 Resource Problems 479 non-null int64 15 Personnel Selection & Staffing 479 non-null int64 16 Climate/ Culture Influences 479 non-null int64 17 Policy & Process Issues 479 non-null int64 18 Technology Failure 479 non-null int64 19 Acts 479 non-null int64 20 Preconditions 479 non-null int64 21 Supervision 479 non-null int64 22 Organization 479 non-null int64 23 Fatal or Serious 479 non-null int64 24 Flight Segment 1=Taxi 479 non-null int64 25 91=1/121=0 479 non-null int64 dtypes: int64(25), object(1) memory usage: 97.4+ KB
data['EventId'].nunique()
477
data[data['EventId'].duplicated()]
| EventId | Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | ... | Climate/ Culture Influences | Policy & Process Issues | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 157 | 20090604X25647 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 0 |
| 303 | 20110518X94643 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 0 |
2 rows × 26 columns
data[data['EventId']=='20090604X25647']
| EventId | Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | ... | Climate/ Culture Influences | Policy & Process Issues | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 156 | 20090604X25647 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 0 |
| 157 | 20090604X25647 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 0 |
2 rows × 26 columns
data.drop(157,axis=0,inplace=True)
data[data['EventId']=='20090604X25647']
| EventId | Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | ... | Climate/ Culture Influences | Policy & Process Issues | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 156 | 20090604X25647 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 0 |
1 rows × 26 columns
data[data['EventId']=='20110518X94643']
| EventId | Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | ... | Climate/ Culture Influences | Policy & Process Issues | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 302 | 20110518X94643 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 |
| 303 | 20110518X94643 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 0 |
2 rows × 26 columns
data.drop(303,axis=0,inplace=True)
data[data['EventId']=='20110518X94643']
| EventId | Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | ... | Climate/ Culture Influences | Policy & Process Issues | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 302 | 20110518X94643 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 |
1 rows × 26 columns
data['EventId'].nunique()
477
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 477 entries, 0 to 478 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EventId 477 non-null object 1 Performance-Based Errors 477 non-null int64 2 Judgment & Decision-Making Errors 477 non-null int64 3 Violations 477 non-null int64 4 Teamwork 477 non-null int64 5 Mental Awareness 477 non-null int64 6 State of Mind 477 non-null int64 7 Physical Problems 477 non-null int64 8 Sensory Misperception 477 non-null int64 9 Physical Environment 477 non-null int64 10 Technological Environment 477 non-null int64 11 Inadequate Supervision 477 non-null int64 12 Planned Inappropriate Operations 477 non-null int64 13 Supervisory Violations 477 non-null int64 14 Resource Problems 477 non-null int64 15 Personnel Selection & Staffing 477 non-null int64 16 Climate/ Culture Influences 477 non-null int64 17 Policy & Process Issues 477 non-null int64 18 Technology Failure 477 non-null int64 19 Acts 477 non-null int64 20 Preconditions 477 non-null int64 21 Supervision 477 non-null int64 22 Organization 477 non-null int64 23 Fatal or Serious 477 non-null int64 24 Flight Segment 1=Taxi 477 non-null int64 25 91=1/121=0 477 non-null int64 dtypes: int64(25), object(1) memory usage: 100.6+ KB
plt.figure(figsize=(8,5))
sns.heatmap(data=data.isnull(),cbar=False,yticklabels=False,cmap='viridis')
<AxesSubplot:>
data.isnull().sum()
EventId 0 Performance-Based Errors 0 Judgment & Decision-Making Errors 0 Violations 0 Teamwork 0 Mental Awareness 0 State of Mind 0 Physical Problems 0 Sensory Misperception 0 Physical Environment 0 Technological Environment 0 Inadequate Supervision 0 Planned Inappropriate Operations 0 Supervisory Violations 0 Resource Problems 0 Personnel Selection & Staffing 0 Climate/ Culture Influences 0 Policy & Process Issues 0 Technology Failure 0 Acts 0 Preconditions 0 Supervision 0 Organization 0 Fatal or Serious 0 Flight Segment 1=Taxi 0 91=1/121=0 0 dtype: int64
data['Performance-Based Errors'].value_counts()
0 249 1 228 Name: Performance-Based Errors, dtype: int64
data['Judgment & Decision-Making Errors'].value_counts()
0 416 1 61 Name: Judgment & Decision-Making Errors, dtype: int64
data['Violations'].value_counts()
0 443 1 34 Name: Violations, dtype: int64
data['Teamwork'].value_counts()
0 460 1 17 Name: Teamwork, dtype: int64
data['Mental Awareness'].value_counts()
0 469 1 8 Name: Mental Awareness, dtype: int64
data['State of Mind'].value_counts()
0 475 1 2 Name: State of Mind, dtype: int64
data['Physical Problems'].value_counts()
0 458 1 19 Name: Physical Problems, dtype: int64
data['Sensory Misperception'].value_counts()
0 468 1 9 Name: Sensory Misperception, dtype: int64
data['Physical Environment'].value_counts()
0 399 1 78 Name: Physical Environment, dtype: int64
data['Technological Environment'].value_counts()
0 476 1 1 Name: Technological Environment, dtype: int64
data['Inadequate Supervision'].value_counts()
0 456 1 21 Name: Inadequate Supervision, dtype: int64
data['Planned Inappropriate Operations'].value_counts()
0 476 1 1 Name: Planned Inappropriate Operations, dtype: int64
data['Supervisory Violations'].value_counts()
0 477 Name: Supervisory Violations, dtype: int64
data['Resource Problems'].value_counts()
0 470 1 7 Name: Resource Problems, dtype: int64
data['Personnel Selection & Staffing'].value_counts()
0 477 Name: Personnel Selection & Staffing, dtype: int64
data['Climate/ Culture Influences'].value_counts()
0 477 Name: Climate/ Culture Influences, dtype: int64
data['Policy & Process Issues'].value_counts()
0 461 1 16 Name: Policy & Process Issues, dtype: int64
data['Technology Failure'].value_counts()
0 324 1 153 Name: Technology Failure, dtype: int64
data['Acts'].value_counts()
1 304 0 173 Name: Acts, dtype: int64
data['Preconditions'].value_counts()
0 356 1 121 Name: Preconditions, dtype: int64
data['Supervision'].value_counts()
0 455 1 22 Name: Supervision, dtype: int64
data['Organization'].value_counts()
0 454 1 23 Name: Organization, dtype: int64
data['Fatal or Serious'].value_counts()
0 307 1 170 Name: Fatal or Serious, dtype: int64
data['Flight Segment 1=Taxi'].value_counts()
7 170 2 115 6 81 5 43 4 35 3 33 Name: Flight Segment 1=Taxi, dtype: int64
data['91=1/121=0'].value_counts()
1 325 0 152 Name: 91=1/121=0, dtype: int64
data.head()
| EventId | Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Teamwork | Mental Awareness | State of Mind | Physical Problems | Sensory Misperception | Physical Environment | ... | Climate/ Culture Influences | Policy & Process Issues | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 20060629X00856 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 1 | 20060113X00068 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 4 | 1 |
| 2 | 20060929X01431 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 3 | 20070109X00026 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 7 | 1 |
| 4 | 20060719X00965 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
5 rows × 26 columns
data.drop(['EventId','Teamwork' , 'Mental Awareness' , 'State of Mind' , 'Physical Problems' , 'Sensory Misperception' ,
'Technological Environment' , 'Planned Inappropriate Operations' , 'Supervisory Violations' ,
'Resource Problems' , 'Personnel Selection & Staffing' , 'Climate/ Culture Influences' , 'Policy & Process Issues'],
axis=1,inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 477 entries, 0 to 478 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Performance-Based Errors 477 non-null int64 1 Judgment & Decision-Making Errors 477 non-null int64 2 Violations 477 non-null int64 3 Physical Environment 477 non-null int64 4 Inadequate Supervision 477 non-null int64 5 Technology Failure 477 non-null int64 6 Acts 477 non-null int64 7 Preconditions 477 non-null int64 8 Supervision 477 non-null int64 9 Organization 477 non-null int64 10 Fatal or Serious 477 non-null int64 11 Flight Segment 1=Taxi 477 non-null int64 12 91=1/121=0 477 non-null int64 dtypes: int64(13) memory usage: 52.2 KB
data.head()
| Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Physical Environment | Inadequate Supervision | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 4 | 1 |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 7 | 1 |
| 4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
sns.pairplot(data , hue='Fatal or Serious', palette='Set1')
<seaborn.axisgrid.PairGrid at 0x262e9d69640>
plt.figure(figsize=(12,8))
sns.heatmap(data.corr(),cmap='coolwarm')
<AxesSubplot:>
#analyzing the dataset
advert_report = sv.analyze([data,'Accident'],target_feat='Fatal or Serious',pairwise_analysis="on")
#display the report
advert_report.show_notebook(w='100%')
my_report = sv.compare_intra(data, data['Fatal or Serious'] ==1, ["Fatal", "Serious"])
my_report.show_notebook(w='100%')
Train Dataset:
Set of data used for learning (by the model), that is, to fit the parameters to the machine learning model
Valid Dataset:
Set of data used to provide an unbiased evaluation of a model fitted on the training dataset while tuning model hyperparameters. Also play a role in other forms of model preparation, such as feature selection, threshold cut-off selection.
Test Dataset:
Set of data used to provide an unbiased evaluation of a final model fitted on the training dataset.
from fast_ml.model_development import train_valid_test_split
X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(data, target = 'Fatal or Serious',
train_size=0.7, valid_size=0.1, test_size=0.2,
random_state=101)
print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)
(333, 12) (333,) (48, 12) (48,) (96, 12) (96,)
(None, None)
X_train
| Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Physical Environment | Inadequate Supervision | Technology Failure | Acts | Preconditions | Supervision | Organization | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 429 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 1 |
| 333 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 1 |
| 109 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 1 |
| 93 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 5 | 0 |
| 229 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 63 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 7 | 1 |
| 328 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 1 |
| 339 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 11 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 353 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 5 | 0 |
333 rows × 12 columns
y_train
429 1
333 1
109 0
93 0
229 0
..
63 1
328 0
339 0
11 0
353 0
Name: Fatal or Serious, Length: 333, dtype: int64
X_valid
| Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Physical Environment | Inadequate Supervision | Technology Failure | Acts | Preconditions | Supervision | Organization | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 114 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 1 |
| 440 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 1 |
| 143 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 |
| 226 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 0 |
| 268 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 2 | 1 |
| 278 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 0 |
| 138 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 |
| 31 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 1 |
| 91 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 2 | 0 |
| 406 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 100 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 1 |
| 128 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 1 |
| 315 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 1 |
| 104 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 4 | 1 |
| 212 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 7 | 0 |
| 169 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 1 |
| 190 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 1 |
| 260 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 438 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 391 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 418 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 6 | 0 |
| 133 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 0 |
| 317 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 7 | 1 |
| 334 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | 1 |
| 117 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 238 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 7 | 1 |
| 454 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 0 |
| 213 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 0 |
| 130 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 7 | 1 |
| 346 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 349 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 1 |
| 180 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 210 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 0 |
| 419 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 1 |
| 17 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 6 | 1 |
| 184 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 5 | 1 |
| 351 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 478 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 6 | 0 |
| 408 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | 1 |
| 13 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 1 |
| 211 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 0 |
| 38 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 1 |
| 72 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 1 |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 387 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 179 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 2 | 1 |
| 308 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 0 |
| 412 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 0 |
y_valid
114 1 440 0 143 1 226 0 268 0 278 0 138 0 31 0 91 0 406 0 100 1 128 0 315 1 104 1 212 0 169 1 190 1 260 0 438 0 391 0 418 1 133 0 317 0 334 1 117 0 238 0 454 1 213 0 130 0 346 0 349 0 180 0 210 0 419 0 17 1 184 1 351 0 478 0 408 1 13 0 211 1 38 1 72 1 2 0 387 0 179 0 308 1 412 0 Name: Fatal or Serious, dtype: int64
X_test
| Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Physical Environment | Inadequate Supervision | Technology Failure | Acts | Preconditions | Supervision | Organization | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 455 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 0 |
| 399 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 166 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 6 | 1 |
| 188 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 7 | 1 |
| 53 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 234 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 5 | 0 |
| 8 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | 1 |
| 468 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 335 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 2 | 1 |
96 rows × 12 columns
y_test
455 0
399 0
166 1
188 0
53 0
..
4 0
234 0
8 0
468 0
335 1
Name: Fatal or Serious, Length: 96, dtype: int64
X_train_imbalance = X_train.copy()
y_train_imbalance = y_train.copy()
from imblearn.over_sampling import SMOTE
from collections import Counter
oversample = SMOTE()
X_train,y_train = oversample.fit_resample(X_train,y_train)
counter = Counter(y_train)
counter
Counter({1: 206, 0: 206})
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=101)
lr.fit(X_train,y_train)
LogisticRegression(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(random_state=101)
pred = lr.predict(X_test)
print('Score:\n',lr.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred))
Score:
0.7395833333333334
Confusion Matrix:
[[53 17]
[ 8 18]]
Classification Report:
precision recall f1-score support
0 0.87 0.76 0.81 70
1 0.51 0.69 0.59 26
accuracy 0.74 96
macro avg 0.69 0.72 0.70 96
weighted avg 0.77 0.74 0.75 96
mat_T = confusion_matrix(y_test,pred)
sns.heatmap(mat_T, square=True, annot=True,fmt='d',cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
df_eval = pd.DataFrame({'Label':y_test,
'Prediction': pred})
df_eval
| Label | Prediction | |
|---|---|---|
| 455 | 0 | 0 |
| 399 | 0 | 0 |
| 166 | 1 | 0 |
| 188 | 0 | 1 |
| 53 | 0 | 0 |
| ... | ... | ... |
| 4 | 0 | 0 |
| 234 | 0 | 1 |
| 8 | 0 | 1 |
| 468 | 0 | 0 |
| 335 | 1 | 1 |
96 rows × 2 columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data.drop(['Fatal or Serious'],axis=1))
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
StandardScaler()
scaled_features = scaler.transform(data.drop(['Fatal or Serious'],axis=1))
df_feat = pd.DataFrame(scaled_features,columns = data.columns[:-1])
df_feat
| Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Physical Environment | Inadequate Supervision | Technology Failure | Acts | Preconditions | Supervision | Organization | Fatal or Serious | Flight Segment 1=Taxi | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.045038 | -0.382929 | -0.277037 | -0.442141 | -0.214599 | -0.687184 | 0.754373 | -0.582999 | -0.219890 | -0.22508 | 1.012065 | 0.683880 |
| 1 | -0.956903 | 2.611450 | -0.277037 | -0.442141 | -0.214599 | -0.687184 | 0.754373 | -0.582999 | -0.219890 | -0.22508 | -0.467266 | 0.683880 |
| 2 | 1.045038 | -0.382929 | -0.277037 | -0.442141 | -0.214599 | -0.687184 | 0.754373 | -0.582999 | -0.219890 | -0.22508 | 1.012065 | 0.683880 |
| 3 | -0.956903 | -0.382929 | -0.277037 | -0.442141 | -0.214599 | 1.455214 | -1.325604 | -0.582999 | -0.219890 | -0.22508 | 1.012065 | 0.683880 |
| 4 | 1.045038 | -0.382929 | -0.277037 | -0.442141 | -0.214599 | -0.687184 | 0.754373 | -0.582999 | -0.219890 | -0.22508 | 1.012065 | 0.683880 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 472 | -0.956903 | -0.382929 | 3.609628 | -0.442141 | -0.214599 | -0.687184 | 0.754373 | -0.582999 | -0.219890 | -0.22508 | 0.025844 | -1.462244 |
| 473 | -0.956903 | -0.382929 | -0.277037 | -0.442141 | -0.214599 | -0.687184 | -1.325604 | -0.582999 | -0.219890 | -0.22508 | -0.467266 | -1.462244 |
| 474 | -0.956903 | -0.382929 | -0.277037 | 2.261722 | -0.214599 | -0.687184 | -1.325604 | 1.715269 | -0.219890 | -0.22508 | 0.025844 | -1.462244 |
| 475 | -0.956903 | -0.382929 | -0.277037 | 2.261722 | -0.214599 | -0.687184 | -1.325604 | 1.715269 | -0.219890 | -0.22508 | 0.025844 | -1.462244 |
| 476 | -0.956903 | -0.382929 | -0.277037 | -0.442141 | 4.659859 | 1.455214 | -1.325604 | -0.582999 | 4.547727 | -0.22508 | 0.518955 | -1.462244 |
477 rows × 12 columns
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
pred = knn.predict(X_valid)
print('Score:\n',knn.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.6666666666666666
Confusion Matrix:
[[22 9]
[ 7 10]]
Classification Report:
precision recall f1-score support
0 0.76 0.71 0.73 31
1 0.53 0.59 0.56 17
accuracy 0.67 48
macro avg 0.64 0.65 0.64 48
weighted avg 0.68 0.67 0.67 48
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True,fmt='d',cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
error_rate = []
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors= i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_valid)
error_rate.append(np.mean(pred_i != y_valid))
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue',ls='--',marker='o',markerfacecolor='red',markersize=10)
plt.title('ERROR RATE VS. K VALUES')
plt.xlabel('K')
plt.ylabel('ERROR RATE')
Text(0, 0.5, 'ERROR RATE')
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train,y_train)
pred = knn.predict(X_valid)
print('Score:\n',knn.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.7083333333333334
Confusion Matrix:
[[21 10]
[ 4 13]]
Classification Report:
precision recall f1-score support
0 0.84 0.68 0.75 31
1 0.57 0.76 0.65 17
accuracy 0.71 48
macro avg 0.70 0.72 0.70 48
weighted avg 0.74 0.71 0.71 48
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True,fmt='d',cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=101)
dt.fit(X_train,y_train)
DecisionTreeClassifier(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=101)
pred = dt.predict(X_valid)
print('Score:\n',dt.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.7291666666666666
Confusion Matrix:
[[24 7]
[ 6 11]]
Classification Report:
precision recall f1-score support
0 0.80 0.77 0.79 31
1 0.61 0.65 0.63 17
accuracy 0.73 48
macro avg 0.71 0.71 0.71 48
weighted avg 0.73 0.73 0.73 48
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True,fmt='d',cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
df_eval = pd.DataFrame({'Label':y_valid,
'Prediction': pred})
df_eval
| Label | Prediction | |
|---|---|---|
| 114 | 1 | 1 |
| 440 | 0 | 1 |
| 143 | 1 | 0 |
| 226 | 0 | 0 |
| 268 | 0 | 1 |
| 278 | 0 | 0 |
| 138 | 0 | 1 |
| 31 | 0 | 0 |
| 91 | 0 | 0 |
| 406 | 0 | 0 |
| 100 | 1 | 1 |
| 128 | 0 | 1 |
| 315 | 1 | 1 |
| 104 | 1 | 0 |
| 212 | 0 | 0 |
| 169 | 1 | 0 |
| 190 | 1 | 1 |
| 260 | 0 | 0 |
| 438 | 0 | 0 |
| 391 | 0 | 0 |
| 418 | 1 | 1 |
| 133 | 0 | 0 |
| 317 | 0 | 0 |
| 334 | 1 | 1 |
| 117 | 0 | 0 |
| 238 | 0 | 0 |
| 454 | 1 | 1 |
| 213 | 0 | 0 |
| 130 | 0 | 0 |
| 346 | 0 | 0 |
| 349 | 0 | 0 |
| 180 | 0 | 0 |
| 210 | 0 | 1 |
| 419 | 0 | 1 |
| 17 | 1 | 0 |
| 184 | 1 | 0 |
| 351 | 0 | 0 |
| 478 | 0 | 0 |
| 408 | 1 | 1 |
| 13 | 0 | 1 |
| 211 | 1 | 1 |
| 38 | 1 | 1 |
| 72 | 1 | 1 |
| 2 | 0 | 0 |
| 387 | 0 | 0 |
| 179 | 0 | 0 |
| 308 | 1 | 0 |
| 412 | 0 | 0 |
from sklearn.tree import plot_tree
plt.figure(figsize=(50,40))
plot_tree(dt, filled=True)
plt.title("Decision Tree")
plt.savefig('Decision Tree Visualize.jpg')
pd.DataFrame(data=dt.feature_importances_,index=X_train.columns,columns=['Value']).sort_values(by='Value',ascending=False)
| Value | |
|---|---|
| Flight Segment 1=Taxi | 0.499536 |
| 91=1/121=0 | 0.142917 |
| Preconditions | 0.056012 |
| Physical Environment | 0.052905 |
| Performance-Based Errors | 0.052388 |
| Technology Failure | 0.052089 |
| Supervision | 0.044917 |
| Violations | 0.036851 |
| Judgment & Decision-Making Errors | 0.029812 |
| Acts | 0.023049 |
| Organization | 0.009523 |
| Inadequate Supervision | 0.000000 |
param_grid = {'criterion':['gini','entropy'],'splitter':['best', 'random']}
from sklearn.model_selection import GridSearchCV
grid1 = GridSearchCV(DecisionTreeClassifier(random_state=101),param_grid,refit=True,verbose=3)
grid1.fit(X_train,y_train)
pred = grid1.predict(X_valid)
Fitting 5 folds for each of 4 candidates, totalling 20 fits [CV 1/5] END .....criterion=gini, splitter=best;, score=0.783 total time= 0.0s [CV 2/5] END .....criterion=gini, splitter=best;, score=0.675 total time= 0.0s [CV 3/5] END .....criterion=gini, splitter=best;, score=0.793 total time= 0.0s [CV 4/5] END .....criterion=gini, splitter=best;, score=0.780 total time= 0.0s [CV 5/5] END .....criterion=gini, splitter=best;, score=0.744 total time= 0.0s [CV 1/5] END ...criterion=gini, splitter=random;, score=0.783 total time= 0.0s [CV 2/5] END ...criterion=gini, splitter=random;, score=0.687 total time= 0.0s [CV 3/5] END ...criterion=gini, splitter=random;, score=0.817 total time= 0.0s [CV 4/5] END ...criterion=gini, splitter=random;, score=0.793 total time= 0.0s [CV 5/5] END ...criterion=gini, splitter=random;, score=0.744 total time= 0.0s [CV 1/5] END ..criterion=entropy, splitter=best;, score=0.783 total time= 0.0s [CV 2/5] END ..criterion=entropy, splitter=best;, score=0.687 total time= 0.0s [CV 3/5] END ..criterion=entropy, splitter=best;, score=0.793 total time= 0.0s [CV 4/5] END ..criterion=entropy, splitter=best;, score=0.780 total time= 0.0s [CV 5/5] END ..criterion=entropy, splitter=best;, score=0.744 total time= 0.0s [CV 1/5] END criterion=entropy, splitter=random;, score=0.819 total time= 0.0s [CV 2/5] END criterion=entropy, splitter=random;, score=0.687 total time= 0.0s [CV 3/5] END criterion=entropy, splitter=random;, score=0.793 total time= 0.0s [CV 4/5] END criterion=entropy, splitter=random;, score=0.793 total time= 0.0s [CV 5/5] END criterion=entropy, splitter=random;, score=0.732 total time= 0.0s
grid1.best_params_
{'criterion': 'gini', 'splitter': 'random'}
print('Score:\n',grid1.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.7708333333333334
Confusion Matrix:
[[24 7]
[ 4 13]]
Classification Report:
precision recall f1-score support
0 0.86 0.77 0.81 31
1 0.65 0.76 0.70 17
accuracy 0.77 48
macro avg 0.75 0.77 0.76 48
weighted avg 0.78 0.77 0.77 48
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True,fmt='d',cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100,random_state=101)
rf.fit(X_train,y_train)
RandomForestClassifier(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=101)
pred_rf = rf.predict(X_valid)
df_eval = pd.DataFrame({'Label':y_valid,
'Prediction': pred_rf
})
df_eval
| Label | Prediction | |
|---|---|---|
| 114 | 1 | 1 |
| 440 | 0 | 1 |
| 143 | 1 | 1 |
| 226 | 0 | 0 |
| 268 | 0 | 1 |
| 278 | 0 | 0 |
| 138 | 0 | 1 |
| 31 | 0 | 0 |
| 91 | 0 | 0 |
| 406 | 0 | 0 |
| 100 | 1 | 1 |
| 128 | 0 | 1 |
| 315 | 1 | 1 |
| 104 | 1 | 0 |
| 212 | 0 | 0 |
| 169 | 1 | 0 |
| 190 | 1 | 1 |
| 260 | 0 | 0 |
| 438 | 0 | 0 |
| 391 | 0 | 0 |
| 418 | 1 | 1 |
| 133 | 0 | 0 |
| 317 | 0 | 0 |
| 334 | 1 | 1 |
| 117 | 0 | 0 |
| 238 | 0 | 0 |
| 454 | 1 | 1 |
| 213 | 0 | 1 |
| 130 | 0 | 0 |
| 346 | 0 | 0 |
| 349 | 0 | 0 |
| 180 | 0 | 0 |
| 210 | 0 | 1 |
| 419 | 0 | 0 |
| 17 | 1 | 0 |
| 184 | 1 | 0 |
| 351 | 0 | 0 |
| 478 | 0 | 0 |
| 408 | 1 | 1 |
| 13 | 0 | 1 |
| 211 | 1 | 1 |
| 38 | 1 | 1 |
| 72 | 1 | 1 |
| 2 | 0 | 0 |
| 387 | 0 | 0 |
| 179 | 0 | 0 |
| 308 | 1 | 0 |
| 412 | 0 | 0 |
print('Score:\n',rf.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred_rf),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred_rf))
Score:
0.75
Confusion Matrix:
[[24 7]
[ 5 12]]
Classification Report:
precision recall f1-score support
0 0.83 0.77 0.80 31
1 0.63 0.71 0.67 17
accuracy 0.75 48
macro avg 0.73 0.74 0.73 48
weighted avg 0.76 0.75 0.75 48
mat_T = confusion_matrix(y_valid,pred_rf)
sns.heatmap(mat_T, square=True, annot=True,fmt='d',cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
pd.DataFrame(data=rf.feature_importances_,index=X_train.columns,columns=['Value']).sort_values(by='Value',ascending=False)
| Value | |
|---|---|
| Flight Segment 1=Taxi | 0.482800 |
| 91=1/121=0 | 0.098460 |
| Technology Failure | 0.075286 |
| Preconditions | 0.063597 |
| Performance-Based Errors | 0.061734 |
| Violations | 0.051583 |
| Physical Environment | 0.046682 |
| Acts | 0.041104 |
| Judgment & Decision-Making Errors | 0.033210 |
| Supervision | 0.016857 |
| Organization | 0.014778 |
| Inadequate Supervision | 0.013907 |
from sklearn.svm import SVC
svm = SVC(random_state=101)
svm.fit(X_train,y_train)
SVC(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(random_state=101)
pred = svm.predict(X_valid)
df_eval = pd.DataFrame({'Label':y_valid,
'Prediction': pred
})
df_eval
| Label | Prediction | |
|---|---|---|
| 114 | 1 | 1 |
| 440 | 0 | 1 |
| 143 | 1 | 0 |
| 226 | 0 | 0 |
| 268 | 0 | 1 |
| 278 | 0 | 0 |
| 138 | 0 | 1 |
| 31 | 0 | 1 |
| 91 | 0 | 0 |
| 406 | 0 | 0 |
| 100 | 1 | 1 |
| 128 | 0 | 1 |
| 315 | 1 | 1 |
| 104 | 1 | 1 |
| 212 | 0 | 0 |
| 169 | 1 | 1 |
| 190 | 1 | 1 |
| 260 | 0 | 0 |
| 438 | 0 | 0 |
| 391 | 0 | 0 |
| 418 | 1 | 1 |
| 133 | 0 | 0 |
| 317 | 0 | 0 |
| 334 | 1 | 1 |
| 117 | 0 | 0 |
| 238 | 0 | 0 |
| 454 | 1 | 1 |
| 213 | 0 | 1 |
| 130 | 0 | 0 |
| 346 | 0 | 0 |
| 349 | 0 | 1 |
| 180 | 0 | 0 |
| 210 | 0 | 1 |
| 419 | 0 | 0 |
| 17 | 1 | 1 |
| 184 | 1 | 1 |
| 351 | 0 | 0 |
| 478 | 0 | 0 |
| 408 | 1 | 1 |
| 13 | 0 | 1 |
| 211 | 1 | 1 |
| 38 | 1 | 1 |
| 72 | 1 | 1 |
| 2 | 0 | 0 |
| 387 | 0 | 0 |
| 179 | 0 | 1 |
| 308 | 1 | 1 |
| 412 | 0 | 0 |
print('Score:\n',svm.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.7708333333333334
Confusion Matrix:
[[21 10]
[ 1 16]]
Classification Report:
precision recall f1-score support
0 0.95 0.68 0.79 31
1 0.62 0.94 0.74 17
accuracy 0.77 48
macro avg 0.78 0.81 0.77 48
weighted avg 0.83 0.77 0.78 48
df_eval[df_eval['Label'] != df_eval['Prediction']]
| Label | Prediction | |
|---|---|---|
| 440 | 0 | 1 |
| 143 | 1 | 0 |
| 268 | 0 | 1 |
| 138 | 0 | 1 |
| 31 | 0 | 1 |
| 128 | 0 | 1 |
| 213 | 0 | 1 |
| 349 | 0 | 1 |
| 210 | 0 | 1 |
| 13 | 0 | 1 |
| 179 | 0 | 1 |
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True,fmt='d',cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(X_train, y_train)
pred = bnb.predict(X_valid)
print("Number of mislabeled points out of a total %d points : %d" % (X_valid.shape[0], (y_valid != pred).sum()))
Number of mislabeled points out of a total 48 points : 19
print('Score:\n',bnb.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.6041666666666666
Confusion Matrix:
[[22 9]
[10 7]]
Classification Report:
precision recall f1-score support
0 0.69 0.71 0.70 31
1 0.44 0.41 0.42 17
accuracy 0.60 48
macro avg 0.56 0.56 0.56 48
weighted avg 0.60 0.60 0.60 48
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True,fmt='d',cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
from sklearn.ensemble import BaggingClassifier
bca = BaggingClassifier(SVC(random_state=101),random_state=101)
bca.fit(X_train, y_train)
BaggingClassifier(estimator=SVC(random_state=101), random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier(estimator=SVC(random_state=101), random_state=101)
SVC(random_state=101)
SVC(random_state=101)
pred = bca.predict(X_valid)
print('Score:\n',bca.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.7708333333333334
Confusion Matrix:
[[21 10]
[ 1 16]]
Classification Report:
precision recall f1-score support
0 0.95 0.68 0.79 31
1 0.62 0.94 0.74 17
accuracy 0.77 48
macro avg 0.78 0.81 0.77 48
weighted avg 0.83 0.77 0.78 48
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(random_state=101)
gbc.fit(X_train, y_train)
GradientBoostingClassifier(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=101)
pred = gbc.predict(X_valid)
print('Score:\n',gbc.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.7916666666666666
Confusion Matrix:
[[25 6]
[ 4 13]]
Classification Report:
precision recall f1-score support
0 0.86 0.81 0.83 31
1 0.68 0.76 0.72 17
accuracy 0.79 48
macro avg 0.77 0.79 0.78 48
weighted avg 0.80 0.79 0.79 48
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
xgb = XGBClassifier()
xgb.fit(X_train,y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)pred = xgb.predict(X_valid)
predictions = [round(value) for value in pred]
accuracy = accuracy_score(y_valid, predictions)
print("Accuracy: %.3f%%" % (accuracy * 100.0))
Accuracy: 77.083%
print('Score:\n',xgb.score(X_valid,y_valid),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_valid,pred),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_valid,pred))
Score:
0.7708333333333334
Confusion Matrix:
[[25 6]
[ 5 12]]
Classification Report:
precision recall f1-score support
0 0.83 0.81 0.82 31
1 0.67 0.71 0.69 17
accuracy 0.77 48
macro avg 0.75 0.76 0.75 48
weighted avg 0.77 0.77 0.77 48
mat_T = confusion_matrix(y_valid,pred)
sns.heatmap(mat_T, square=True, annot=True, fmt='d', cbar=False)
plt.xlabel('true label')
plt.ylabel('predicted label');
from sklearn.ensemble import VotingClassifier
model_1 = SVC(random_state=101)
model_1.fit(X_train, y_train)
pred_1 = model_1.predict(X_valid)
model_2 = GradientBoostingClassifier(random_state=101)
model_2.fit(X_train, y_train)
pred_2 = model_2.predict(X_valid)
model_3 = RandomForestClassifier(n_estimators=100,random_state=101)
model_3.fit(X_train, y_train)
pred_3 = model_3.predict(X_valid)
eclf = VotingClassifier(estimators=[('Support Vector Machine', model_1),
('Gradient Boosting', model_2),
('Random Forest', model_3)]
,voting='hard')
from sklearn.model_selection import cross_val_score
for clf, label in zip([model_1, model_2, model_3, eclf], ['Support Vector Machine', 'Gradient Boosting' , 'Random Forest',
'Ensemble']):
scores = cross_val_score(clf, X_valid, y_valid, scoring='accuracy', cv=5)
print("Accuracy: %0.3f (+/- %0.3f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.607 (+/- 0.108) [Support Vector Machine] Accuracy: 0.631 (+/- 0.199) [Gradient Boosting] Accuracy: 0.753 (+/- 0.147) [Random Forest] Accuracy: 0.733 (+/- 0.159) [Ensemble]
X_trainValid = pd.concat([X_train,X_valid],axis=0)
X_trainValid
| Performance-Based Errors | Judgment & Decision-Making Errors | Violations | Physical Environment | Inadequate Supervision | Technology Failure | Acts | Preconditions | Supervision | Organization | Flight Segment 1=Taxi | 91=1/121=0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 1 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 1 |
| 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 1 |
| 3 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 5 | 0 |
| 4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 1 |
| 387 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 7 | 1 |
| 179 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 2 | 1 |
| 308 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | 0 |
| 412 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 7 | 0 |
460 rows × 12 columns
y_trainValid = pd.concat([y_train,y_valid],axis=0)
y_trainValid
0 1
1 1
2 0
3 0
4 0
..
2 0
387 0
179 0
308 1
412 0
Name: Fatal or Serious, Length: 460, dtype: int64
model_0 = LogisticRegression(random_state=101)
model_0.fit(X_trainValid, y_trainValid)
LogisticRegression(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(random_state=101)
from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_0, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.63043478 0.58695652 0.63043478 0.73188406 0.67391304] 0.651 accuracy with a standard deviation of 0.049
pred_0 = model_0.predict(X_test)
print('Score:\n',model_0.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_0),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_0))
Score:
0.7604166666666666
Confusion Matrix:
[[56 14]
[ 9 17]]
Classification Report:
precision recall f1-score support
0 0.86 0.80 0.83 70
1 0.55 0.65 0.60 26
accuracy 0.76 96
macro avg 0.70 0.73 0.71 96
weighted avg 0.78 0.76 0.77 96
model_1 = RandomForestClassifier(n_estimators=100,random_state=101)
model_1.fit(X_trainValid, y_trainValid)
RandomForestClassifier(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(random_state=101)
from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_1, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.73913043 0.77536232 0.7173913 0.73913043 0.7173913 ] 0.738 accuracy with a standard deviation of 0.021
pred_1 = model_1.predict(X_test)
print('Score:\n',model_1.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_1),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_1))
Score:
0.8125
Confusion Matrix:
[[56 14]
[ 4 22]]
Classification Report:
precision recall f1-score support
0 0.93 0.80 0.86 70
1 0.61 0.85 0.71 26
accuracy 0.81 96
macro avg 0.77 0.82 0.79 96
weighted avg 0.85 0.81 0.82 96
model_2 = XGBClassifier()
model_2.fit(X_trainValid, y_trainValid)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_2, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.76086957 0.76086957 0.71014493 0.73913043 0.73188406] 0.741 accuracy with a standard deviation of 0.019
pred_2 = model_2.predict(X_test)
print('Score:\n',model_2.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_2),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_2))
Score:
0.8020833333333334
Confusion Matrix:
[[56 14]
[ 5 21]]
Classification Report:
precision recall f1-score support
0 0.92 0.80 0.85 70
1 0.60 0.81 0.69 26
accuracy 0.80 96
macro avg 0.76 0.80 0.77 96
weighted avg 0.83 0.80 0.81 96
model_3 = GradientBoostingClassifier(random_state=101)
model_3.fit(X_trainValid, y_trainValid)
GradientBoostingClassifier(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(random_state=101)
from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_3, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.73913043 0.77536232 0.71014493 0.73913043 0.73913043] 0.741 accuracy with a standard deviation of 0.021
pred_3 = model_3.predict(X_test)
print('Score:\n',model_3.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_3),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_3))
Score:
0.8020833333333334
Confusion Matrix:
[[57 13]
[ 6 20]]
Classification Report:
precision recall f1-score support
0 0.90 0.81 0.86 70
1 0.61 0.77 0.68 26
accuracy 0.80 96
macro avg 0.76 0.79 0.77 96
weighted avg 0.82 0.80 0.81 96
model_4 = BaggingClassifier()
model_4.fit(X_trainValid, y_trainValid)
BaggingClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BaggingClassifier()
from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_4, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.75362319 0.77536232 0.7173913 0.73913043 0.69565217] 0.736 accuracy with a standard deviation of 0.028
pred_4 = model_4.predict(X_test)
print('Score:\n',model_4.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_4),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_4))
Score:
0.7395833333333334
Confusion Matrix:
[[54 16]
[ 9 17]]
Classification Report:
precision recall f1-score support
0 0.86 0.77 0.81 70
1 0.52 0.65 0.58 26
accuracy 0.74 96
macro avg 0.69 0.71 0.69 96
weighted avg 0.76 0.74 0.75 96
model_5 = SVC(random_state=101)
model_5.fit(X_trainValid, y_trainValid)
SVC(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(random_state=101)
from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_5, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.77536232 0.75362319 0.7173913 0.70289855 0.74637681] 0.739 accuracy with a standard deviation of 0.026
pred_5 = model_5.predict(X_test)
print('Score:\n',model_5.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_5),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_5))
Score:
0.6979166666666666
Confusion Matrix:
[[49 21]
[ 8 18]]
Classification Report:
precision recall f1-score support
0 0.86 0.70 0.77 70
1 0.46 0.69 0.55 26
accuracy 0.70 96
macro avg 0.66 0.70 0.66 96
weighted avg 0.75 0.70 0.71 96
model_6 = DecisionTreeClassifier(random_state=101)
model_6.fit(X_trainValid, y_trainValid)
DecisionTreeClassifier(random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=101)
from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_6, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.71014493 0.75362319 0.6884058 0.73188406 0.68115942] 0.713 accuracy with a standard deviation of 0.027
pred_6 = model_6.predict(X_test)
print('Score:\n',model_6.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_6),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_6))
Score:
0.8020833333333334
Confusion Matrix:
[[56 14]
[ 5 21]]
Classification Report:
precision recall f1-score support
0 0.92 0.80 0.85 70
1 0.60 0.81 0.69 26
accuracy 0.80 96
macro avg 0.76 0.80 0.77 96
weighted avg 0.83 0.80 0.81 96
model_7 = BernoulliNB()
model_7.fit(X_trainValid, y_trainValid)
BernoulliNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BernoulliNB()
from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_7, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.60869565 0.57246377 0.65942029 0.66666667 0.5942029 ] 0.620 accuracy with a standard deviation of 0.037
pred_7 = model_7.predict(X_test)
print('Score:\n',model_7.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_7),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_7))
Score:
0.7708333333333334
Confusion Matrix:
[[59 11]
[11 15]]
Classification Report:
precision recall f1-score support
0 0.84 0.84 0.84 70
1 0.58 0.58 0.58 26
accuracy 0.77 96
macro avg 0.71 0.71 0.71 96
weighted avg 0.77 0.77 0.77 96
eclf = VotingClassifier(estimators=[('XGboost Classifier', model_2),
('Gradient Boosting Classifier', model_3),
('Random Forest Classifier', model_1)]
,voting='hard')
eclf.fit(X_trainValid, y_trainValid)
VotingClassifier(estimators=[('XGboost Classifier',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric=None,
feature_types=None, gamma=None,
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=No...
max_cat_to_onehot=None,
max_delta_step=None, max_depth=None,
max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
predictor=None, random_state=None, ...)),
('Gradient Boosting Classifier',
GradientBoostingClassifier(random_state=101)),
('Random Forest Classifier',
RandomForestClassifier(random_state=101))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. VotingClassifier(estimators=[('XGboost Classifier',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric=None,
feature_types=None, gamma=None,
gpu_id=None, grow_policy=None,
importance_type=None,
interaction_constraints=No...
max_cat_to_onehot=None,
max_delta_step=None, max_depth=None,
max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
predictor=None, random_state=None, ...)),
('Gradient Boosting Classifier',
GradientBoostingClassifier(random_state=101)),
('Random Forest Classifier',
RandomForestClassifier(random_state=101))])XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None, num_parallel_tree=None,
predictor=None, random_state=None, ...)GradientBoostingClassifier(random_state=101)
RandomForestClassifier(random_state=101)
from sklearn.model_selection import ShuffleSplit
n_samples = X_trainValid.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(eclf, X_trainValid, y_trainValid, cv=cv)
print(scores)
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
[0.75362319 0.7826087 0.7173913 0.75362319 0.74637681] 0.751 accuracy with a standard deviation of 0.021
pred_8 = eclf.predict(X_test)
print('Score:\n',eclf.score(X_test,y_test),'\n')
from sklearn.metrics import confusion_matrix
print('Confusion Matrix:\n',confusion_matrix(y_test,pred_8),'\n')
from sklearn.metrics import classification_report
print('Classification Report:\n',classification_report(y_test,pred_8))
Score:
0.8020833333333334
Confusion Matrix:
[[56 14]
[ 5 21]]
Classification Report:
precision recall f1-score support
0 0.92 0.80 0.85 70
1 0.60 0.81 0.69 26
accuracy 0.80 96
macro avg 0.76 0.80 0.77 96
weighted avg 0.83 0.80 0.81 96
X = pd.concat([X_train,X_valid,X_test],axis=0)
y = pd.concat([y_train,y_valid,y_test],axis=0)
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import ShuffleSplit
n_samples = X_train.shape[0]
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=101)
scores = cross_val_score(model_2, X, y, cv=cv)
scores
array([0.74251497, 0.73652695, 0.76047904, 0.76047904, 0.80838323])
for train_index, test_index in cv.split(X):
print("%s %s" % (train_index, test_index))
[526 486 69 296 248 41 70 321 376 351 43 36 469 209 424 454 170 227 143 384 429 527 293 265 53 165 74 231 51 132 44 552 267 470 350 302 387 525 263 148 154 220 390 133 162 310 372 176 484 15 198 13 175 155 359 230 335 313 160 17 317 283 153 278 425 475 86 73 483 168 7 389 187 109 471 396 548 472 211 93 524 465 385 353 364 121 513 346 521 215 9 482 324 164 152 360 391 323 234 169 474 426 45 334 112 90 126 78 167 534 328 431 134 281 295 29 543 102 304 238 330 495 496 414 259 158 210 20 289 345 206 347 123 333 554 439 119 442 377 251 252 401 135 433 421 247 315 373 245 182 410 531 354 311 479 85 365 159 177 108 242 172 115 298 35 37 34 1 382 367 542 101 150 487 306 458 332 11 553 223 25 459 467 408 355 63 192 221 208 464 550 460 461 22 42 348 6 349 398 88 322 50 80 258 380 193 145 54 232 66 538 26 151 185 445 122 96 518 190 509 84 144 489 456 87 270 307 146 61 196 68 79 129 299 67 423 125 492 166 292 97 301 219 547 485 309 417 450 370 476 551 180 404 280 397 253 52 529 546 428 142 443 184 462 358 77 82 191 217 305 549 27 505 199 356 94 383 254 235 23 21 501 149 285 453 131 46 441 338 451 226 207 163 407 244 452 357 71 366 494 545 262 435 120 473 507 197 250 14 303 340 339 3 28 511 478 113 394 395 92 533 58 555 386 173 137 214 229 438 47 174 537 508 282 378 342 532 411 276 103 444 499 10 30 419 498 539 463 203 273 140 362 512 300 327 535 402 528 329 294 440 436 519 284 99 264 239 457 503 105 343 76 275 249 418 59 371 136 506 49 40 477 5 110 316 393 75 337 523] [ 98 341 48 161 522 189 515 117 413 290 124 24 361 107 55 446 363 400 8 81 422 268 279 455 118 540 39 18 336 488 114 493 186 490 466 106 83 104 434 272 312 286 183 325 412 32 141 520 291 100 261 237 319 530 139 0 181 241 4 236 369 243 195 399 308 178 200 468 368 202 2 420 331 218 138 147 257 260 31 500 156 240 271 64 213 504 255 62 12 497 437 415 326 297 352 432 392 388 19 225 405 502 233 344 416 314 194 517 256 201 33 491 89 427 130 128 448 406 72 277 510 449 318 287 381 60 16 188 205 514 65 56 116 447 320 288 375 541 274 379 516 212 157 409 224 95 480 246 204 179 111 171 430 544 216 127 38 403 269 374 91 222 266 481 57 228 536] [360 88 201 350 525 358 153 467 419 530 25 92 300 347 52 68 85 128 461 285 141 111 267 486 318 158 226 423 274 471 354 162 427 504 495 249 375 340 386 367 16 555 520 143 189 303 460 6 462 422 333 125 117 477 402 175 212 156 533 311 76 48 337 270 256 151 71 209 75 147 199 415 78 107 304 152 283 528 309 407 176 138 42 62 399 553 30 450 380 231 397 96 323 98 63 493 291 193 166 1 296 276 316 93 349 57 225 106 532 9 180 384 455 171 95 204 356 287 400 405 330 8 431 97 382 240 217 232 216 170 535 23 123 142 41 393 108 222 516 43 336 281 325 187 191 265 136 413 186 131 466 110 221 239 299 104 404 289 161 474 546 207 383 470 253 198 174 369 469 228 514 203 364 523 390 537 205 396 130 448 374 184 497 312 118 541 488 317 341 334 511 353 134 398 155 70 5 124 77 432 308 418 439 344 500 154 305 255 94 435 529 245 81 208 328 10 366 527 491 372 169 503 177 433 260 213 518 17 202 554 113 220 286 27 357 40 160 524 190 414 370 445 54 290 543 167 302 11 103 105 430 468 381 183 457 126 388 227 310 148 21 425 542 250 501 456 133 288 294 420 44 539 277 373 452 145 84 24 101 73 473 243 129 526 89 368 252 363 196 275 83 258 480 263 549 150 489 522 13 343 324 376 268 14 115 91 2 496 50 195 550 257 39 234 515 551 365 434 321 391 362 79 481 332 38 322 214 266 65 315 64 59 18 371 417 188 29 295 262 192 244 229 45 215 279 119 66 326 4 314 223 69 206 159 165 146 15 406 436 144 173 102 426 499 200 487 58 272 80 251 248 438 233 507 428 490 348 442 444 235 264 338 424 261 53 392 451 86 441] [ 32 301 3 181 379 429 157 377 319 182 498 99 446 492 60 458 421 540 552 132 219 508 278 320 479 485 329 355 179 378 51 163 331 127 273 313 509 449 346 510 459 49 0 120 109 454 28 236 506 547 100 548 26 351 178 464 20 389 335 345 19 538 409 90 385 387 242 72 403 361 22 246 531 534 440 197 502 292 359 545 307 453 194 410 247 149 271 476 12 339 218 472 168 269 36 293 437 254 67 33 237 35 116 536 408 494 505 482 112 519 185 411 342 46 306 164 352 139 280 56 465 114 513 224 259 82 395 172 37 463 34 121 478 137 517 443 7 483 475 122 241 140 74 31 87 55 298 394 230 282 484 401 544 327 47 297 210 447 61 416 135 512 211 521 412 284 238] [109 379 99 299 274 383 135 218 123 188 142 514 13 328 513 377 315 381 488 219 525 107 305 268 529 311 82 226 216 356 146 395 325 428 439 466 496 410 41 250 171 489 336 74 240 258 527 421 441 55 554 312 161 119 503 64 257 247 28 70 288 223 319 81 192 280 484 332 405 314 338 222 176 84 371 73 256 3 264 285 479 48 32 254 150 460 526 382 316 321 447 436 124 508 497 265 398 477 452 294 75 241 144 530 68 419 59 196 7 438 170 534 420 5 446 450 360 344 423 160 85 253 275 115 174 463 346 205 372 237 16 148 353 173 357 20 467 231 362 34 537 153 132 45 168 491 98 485 29 235 448 290 487 246 186 203 252 374 434 553 416 158 300 276 468 149 156 248 455 298 62 449 507 555 283 334 386 243 427 58 528 339 538 291 533 211 76 145 307 392 220 376 394 1 459 404 194 114 550 97 506 83 191 425 502 50 208 520 232 206 213 159 361 151 306 251 369 154 393 380 175 43 273 72 318 200 470 141 456 229 143 510 179 197 453 472 317 403 542 199 511 364 61 53 431 366 130 122 413 30 270 375 105 202 435 25 536 482 540 324 24 345 71 54 33 227 190 36 535 546 93 548 106 35 4 209 49 351 481 126 31 437 242 512 370 87 471 79 509 181 464 271 347 545 417 465 279 169 320 391 287 492 541 343 407 385 500 552 365 498 91 26 69 56 8 15 21 327 519 80 127 433 373 166 17 272 451 352 116 422 214 522 432 402 310 458 40 445 189 245 277 185 131 400 313 163 408 462 249 23 289 14 444 354 426 63 349 286 38 217 399 128 182 233 46 292 267 384 266 152 331 350 120 204 180 387 103 518 37 47 121 12 94 125 269 139 96 454 27] [136 476 430 184 60 414 230 396 263 494 9 0 293 531 147 281 442 108 543 411 67 389 102 100 239 11 515 39 342 303 129 201 521 480 284 440 547 255 330 368 65 164 326 210 388 236 112 499 483 505 475 92 140 77 549 113 461 111 406 457 10 478 340 90 297 358 88 412 155 296 501 215 259 183 333 95 329 469 134 221 193 378 532 337 278 52 262 118 516 89 490 322 137 19 390 260 308 304 110 177 397 443 44 493 424 207 551 409 78 133 225 282 429 165 238 138 524 335 22 544 261 523 101 295 302 359 341 234 42 301 418 474 66 309 363 244 212 2 517 117 178 198 355 367 167 51 157 104 415 504 187 86 195 473 486 172 18 224 495 57 401 539 323 348 162 6 228] [111 455 546 375 14 221 393 198 52 69 29 18 173 135 65 321 102 230 313 351 7 471 240 297 257 401 30 38 326 181 388 457 305 522 256 204 105 319 194 32 89 165 43 280 489 469 264 23 213 400 238 104 218 156 346 290 259 403 291 318 374 475 444 100 339 539 200 439 359 348 56 35 31 533 247 231 399 252 86 17 453 325 244 543 472 416 157 190 229 128 516 523 129 127 356 114 1 385 39 396 441 538 124 341 161 201 287 118 361 520 85 75 243 440 140 159 9 236 410 466 147 46 502 299 136 98 36 338 175 406 174 96 512 137 188 470 48 387 478 465 412 42 5 434 308 21 125 115 223 420 529 178 130 2 246 377 277 432 384 284 49 265 79 479 120 106 496 116 171 322 253 260 26 176 170 495 126 330 364 383 103 342 548 526 347 391 504 544 306 58 44 534 550 473 446 72 143 47 93 460 0 438 323 445 531 53 324 349 312 224 452 490 447 310 366 70 270 389 192 327 34 462 217 511 255 99 241 554 168 402 211 292 285 508 212 417 207 300 429 179 553 82 431 107 205 203 503 216 196 345 249 109 123 528 276 237 117 435 6 202 149 409 11 426 144 169 263 184 493 81 436 450 199 112 545 380 41 501 362 68 536 54 27 491 294 390 386 333 268 507 422 443 73 369 332 382 195 225 278 142 275 535 148 423 283 340 113 71 16 180 164 233 158 206 226 67 261 25 379 488 464 408 421 146 24 487 484 134 279 84 395 394 424 293 407 301 295 456 370 282 474 110 371 220 88 315 530 414 519 20 547 273 108 427 314 90 552 60 397 187 219 63 485 40 494 269 191 405 45 177 208 33 139 94 87 337 343 413 248 15 411 367 145 525 242 3 320 132 476] [500 154 76 357 122 37 467 59 358 381 182 8 510 281 360 372 392 309 160 459 418 442 303 266 517 186 468 239 250 368 551 355 499 302 296 549 80 521 13 307 335 404 527 12 258 234 83 365 66 463 61 121 481 57 267 245 193 62 101 197 97 433 91 419 524 95 483 50 222 317 51 227 163 363 150 22 298 235 210 509 133 352 271 19 141 515 209 311 451 92 131 74 354 183 498 172 537 304 344 119 262 398 373 77 336 167 272 513 155 10 4 425 497 166 329 328 428 215 458 518 454 514 448 152 286 189 254 316 449 251 542 214 430 162 228 415 541 78 232 55 505 153 540 532 461 185 555 482 378 331 274 334 151 506 64 350 492 437 486 477 353 289 288 28 376 480 138] [ 55 529 87 256 56 134 244 517 479 82 428 53 264 327 13 486 279 406 91 43 274 474 465 0 196 193 117 525 285 148 536 442 453 315 11 544 504 392 291 243 220 516 314 71 429 14 399 328 63 160 73 485 306 116 528 115 423 95 434 283 376 191 258 493 308 58 339 334 401 15 519 7 498 146 451 296 217 496 158 356 75 281 40 547 464 294 310 33 195 19 190 204 181 419 54 377 99 98 407 502 466 530 340 395 64 477 300 57 420 206 284 108 25 510 78 170 61 101 531 103 475 448 292 445 397 415 23 472 408 455 192 252 184 398 72 343 67 234 446 176 242 424 212 293 141 326 482 363 125 301 12 350 136 344 145 34 362 313 223 211 527 21 270 447 267 444 309 484 276 280 543 149 439 175 265 142 533 490 261 329 147 347 378 405 553 354 65 288 177 37 546 323 133 249 278 36 106 69 38 422 413 382 260 457 186 104 233 450 154 216 100 524 489 156 250 132 118 257 208 124 88 62 253 311 50 332 197 225 286 381 1 538 467 421 500 441 228 189 545 86 307 266 411 162 353 537 70 366 268 430 297 129 400 226 394 151 290 324 555 318 10 505 364 404 49 460 345 508 402 391 359 114 456 200 201 277 31 68 110 163 303 138 393 348 203 385 139 213 188 28 352 39 369 52 272 330 514 130 32 454 89 251 282 172 503 140 371 373 540 102 410 199 16 24 357 227 331 299 325 174 128 383 458 2 534 107 414 168 549 185 79 438 443 338 521 51 480 452 167 535 435 333 45 506 495 126 194 302 351 150 367 214 161 539 388 507 426 219 35 523 187 287 122 248 418 152 476 47 224 113 396 349 121 273 131 542 222 42 210 263 526 368 425 159 483 232 305 491 511] [471 481 183 541 321 548 478 518 229 433 127 171 85 74 470 515 358 295 182 337 550 202 259 360 5 90 119 247 440 487 4 312 3 20 387 236 6 215 416 169 449 320 380 46 218 48 317 437 84 341 81 76 166 379 83 335 289 271 499 207 29 365 165 94 123 355 18 44 463 386 461 205 231 370 509 93 319 459 336 80 262 246 26 143 97 173 22 9 180 27 66 372 298 375 92 8 155 255 59 254 120 304 532 501 409 112 105 135 403 492 245 374 157 179 209 361 520 390 269 230 240 17 427 137 60 551 431 468 554 322 178 432 41 513 164 77 237 462 488 384 241 346 473 497 239 275 522 494 96 111 109 198 412 153 235 342 30 417 552 316 469 436 238 389 512 221 144]
print("%0.3f accuracy with a standard deviation of %0.3f" % (scores.mean(), scores.std()))
0.762 accuracy with a standard deviation of 0.025
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
MLA = [model_0,model_1,model_2,model_3,model_4,model_5,model_6,model_7,eclf]
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)
row_index = 0
for alg in MLA:
predicted = alg.fit(X_trainValid, y_trainValid).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index,'MLA used'] = MLA_name
MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_trainValid, y_trainValid), 4)
MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 4)
MLA_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted)
MLA_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted)
MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)
MLA_compare.loc[row_index, 'F1-Score'] = f1_score(y_test, predicted)
row_index+=1
MLA_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)
MLA_compare
| MLA used | Train Accuracy | Test Accuracy | Precission | Recall | AUC | F1-Score | |
|---|---|---|---|---|---|---|---|
| 1 | RandomForestClassifier | 0.8500 | 0.8125 | 0.611111 | 0.846154 | 0.823077 | 0.709677 |
| 2 | XGBClassifier | 0.8457 | 0.8021 | 0.600000 | 0.807692 | 0.803846 | 0.688525 |
| 3 | GradientBoostingClassifier | 0.8174 | 0.8021 | 0.606061 | 0.769231 | 0.791758 | 0.677966 |
| 4 | BaggingClassifier | 0.8457 | 0.8021 | 0.594595 | 0.846154 | 0.815934 | 0.698413 |
| 6 | DecisionTreeClassifier | 0.8500 | 0.8021 | 0.600000 | 0.807692 | 0.803846 | 0.688525 |
| 8 | VotingClassifier | 0.8478 | 0.8021 | 0.600000 | 0.807692 | 0.803846 | 0.688525 |
| 7 | BernoulliNB | 0.6304 | 0.7708 | 0.576923 | 0.576923 | 0.709890 | 0.576923 |
| 0 | LogisticRegression | 0.6870 | 0.7604 | 0.548387 | 0.653846 | 0.726923 | 0.596491 |
| 5 | SVC | 0.7717 | 0.6979 | 0.461538 | 0.692308 | 0.696154 | 0.553846 |
# Creating plot to show the ROC for all MLA
index = 1
for alg in MLA:
predicted = alg.fit(X_trainValid, y_trainValid).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
roc_auc_mla = auc(fp, tp)
MLA_name = alg.__class__.__name__
plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)' % (MLA_name, roc_auc_mla))
index+=1
plt.title('ROC Curve comparison')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.plot([0,1],[0,1],'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
X_trainValid = pd.concat([X_train_imbalance,X_valid],axis=0)
y_trainValid = pd.concat([y_train_imbalance,y_valid],axis=0)
MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)
row_index = 0
for alg in MLA:
predicted = alg.fit(X_trainValid, y_trainValid).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index,'MLA used'] = MLA_name
MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_trainValid, y_trainValid), 4)
MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 4)
MLA_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted)
MLA_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted)
MLA_compare.loc[row_index, 'AUC'] = auc(fp, tp)
MLA_compare.loc[row_index, 'F1-Score'] = f1_score(y_test, predicted)
row_index+=1
MLA_compare.sort_values(by = ['Test Accuracy'], ascending = False, inplace = True)
MLA_compare
| MLA used | Train Accuracy | Test Accuracy | Precission | Recall | AUC | F1-Score | |
|---|---|---|---|---|---|---|---|
| 2 | XGBClassifier | 0.8478 | 0.8125 | 0.633333 | 0.730769 | 0.786813 | 0.678571 |
| 3 | GradientBoostingClassifier | 0.8031 | 0.8125 | 0.642857 | 0.692308 | 0.774725 | 0.666667 |
| 8 | VotingClassifier | 0.8504 | 0.8125 | 0.633333 | 0.730769 | 0.786813 | 0.678571 |
| 1 | RandomForestClassifier | 0.8530 | 0.8021 | 0.600000 | 0.807692 | 0.803846 | 0.688525 |
| 6 | DecisionTreeClassifier | 0.8530 | 0.7917 | 0.593750 | 0.730769 | 0.772527 | 0.655172 |
| 4 | BaggingClassifier | 0.8425 | 0.7812 | 0.571429 | 0.769231 | 0.777473 | 0.655738 |
| 7 | BernoulliNB | 0.6772 | 0.7708 | 0.583333 | 0.538462 | 0.697802 | 0.560000 |
| 0 | LogisticRegression | 0.6772 | 0.7500 | 0.555556 | 0.384615 | 0.635165 | 0.454545 |
| 5 | SVC | 0.7533 | 0.7396 | 0.517241 | 0.576923 | 0.688462 | 0.545455 |
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 477 entries, 0 to 478 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Performance-Based Errors 477 non-null int64 1 Judgment & Decision-Making Errors 477 non-null int64 2 Violations 477 non-null int64 3 Physical Environment 477 non-null int64 4 Inadequate Supervision 477 non-null int64 5 Technology Failure 477 non-null int64 6 Acts 477 non-null int64 7 Preconditions 477 non-null int64 8 Supervision 477 non-null int64 9 Organization 477 non-null int64 10 Fatal or Serious 477 non-null int64 11 Flight Segment 1=Taxi 477 non-null int64 12 91=1/121=0 477 non-null int64 dtypes: int64(13) memory usage: 68.3 KB
data.drop(['Flight Segment 1=Taxi'],axis=1,inplace=True)
from mlxtend.frequent_patterns import apriori, association_rules
apriori = apriori(data, min_support = 0.2, use_colnames = True, verbose = 1)
apriori.sort_values(by='support',ascending=False).head(30)
Processing 9 combinations | Sampling itemset size 3
C:\ProgramData\Anaconda3\lib\site-packages\mlxtend\frequent_patterns\fpcommon.py:110: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type warnings.warn(
| support | itemsets | |
|---|---|---|
| 5 | 0.681342 | (91=1/121=0) |
| 2 | 0.637317 | (Acts) |
| 10 | 0.515723 | (Acts, 91=1/121=0) |
| 0 | 0.477987 | (Performance-Based Errors) |
| 6 | 0.477987 | (Acts, Performance-Based Errors) |
| 7 | 0.387841 | (Performance-Based Errors, 91=1/121=0) |
| 12 | 0.387841 | (Acts, Performance-Based Errors, 91=1/121=0) |
| 4 | 0.356394 | (Fatal or Serious) |
| 1 | 0.320755 | (Technology Failure) |
| 3 | 0.253669 | (Preconditions) |
| 11 | 0.245283 | (Fatal or Serious, 91=1/121=0) |
| 9 | 0.241090 | (Fatal or Serious, Acts) |
| 8 | 0.218029 | (Technology Failure, 91=1/121=0) |
| 13 | 0.205451 | (Fatal or Serious, Acts, 91=1/121=0) |
rules = association_rules(apriori, metric = "support", min_threshold = 0.1)
rules.head(30)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (Acts) | (Performance-Based Errors) | 0.637317 | 0.477987 | 0.477987 | 0.750000 | 1.569079 | 0.173358 | 2.088050 | 1.000000 |
| 1 | (Performance-Based Errors) | (Acts) | 0.477987 | 0.637317 | 0.477987 | 1.000000 | 1.569079 | 0.173358 | inf | 0.694779 |
| 2 | (Performance-Based Errors) | (91=1/121=0) | 0.477987 | 0.681342 | 0.387841 | 0.811404 | 1.190891 | 0.062168 | 1.689630 | 0.307066 |
| 3 | (91=1/121=0) | (Performance-Based Errors) | 0.681342 | 0.477987 | 0.387841 | 0.569231 | 1.190891 | 0.062168 | 1.211815 | 0.503023 |
| 4 | (Technology Failure) | (91=1/121=0) | 0.320755 | 0.681342 | 0.218029 | 0.679739 | 0.997647 | -0.000514 | 0.994994 | -0.003460 |
| 5 | (91=1/121=0) | (Technology Failure) | 0.681342 | 0.320755 | 0.218029 | 0.320000 | 0.997647 | -0.000514 | 0.998890 | -0.007347 |
| 6 | (Fatal or Serious) | (Acts) | 0.356394 | 0.637317 | 0.241090 | 0.676471 | 1.061436 | 0.013954 | 1.121022 | 0.089931 |
| 7 | (Acts) | (Fatal or Serious) | 0.637317 | 0.356394 | 0.241090 | 0.378289 | 1.061436 | 0.013954 | 1.035218 | 0.159588 |
| 8 | (Acts) | (91=1/121=0) | 0.637317 | 0.681342 | 0.515723 | 0.809211 | 1.187672 | 0.081493 | 1.670209 | 0.435688 |
| 9 | (91=1/121=0) | (Acts) | 0.681342 | 0.637317 | 0.515723 | 0.756923 | 1.187672 | 0.081493 | 1.492052 | 0.495881 |
| 10 | (Fatal or Serious) | (91=1/121=0) | 0.356394 | 0.681342 | 0.245283 | 0.688235 | 1.010118 | 0.002457 | 1.022111 | 0.015563 |
| 11 | (91=1/121=0) | (Fatal or Serious) | 0.681342 | 0.356394 | 0.245283 | 0.360000 | 1.010118 | 0.002457 | 1.005634 | 0.031433 |
| 12 | (Acts, Performance-Based Errors) | (91=1/121=0) | 0.477987 | 0.681342 | 0.387841 | 0.811404 | 1.190891 | 0.062168 | 1.689630 | 0.307066 |
| 13 | (Acts, 91=1/121=0) | (Performance-Based Errors) | 0.515723 | 0.477987 | 0.387841 | 0.752033 | 1.573331 | 0.141331 | 2.105165 | 0.752475 |
| 14 | (Performance-Based Errors, 91=1/121=0) | (Acts) | 0.387841 | 0.637317 | 0.387841 | 1.000000 | 1.569079 | 0.140663 | inf | 0.592466 |
| 15 | (Acts) | (Performance-Based Errors, 91=1/121=0) | 0.637317 | 0.387841 | 0.387841 | 0.608553 | 1.569079 | 0.140663 | 1.563836 | 1.000000 |
| 16 | (Performance-Based Errors) | (Acts, 91=1/121=0) | 0.477987 | 0.515723 | 0.387841 | 0.811404 | 1.573331 | 0.141331 | 2.567793 | 0.698079 |
| 17 | (91=1/121=0) | (Acts, Performance-Based Errors) | 0.681342 | 0.477987 | 0.387841 | 0.569231 | 1.190891 | 0.062168 | 1.211815 | 0.503023 |
| 18 | (Fatal or Serious, Acts) | (91=1/121=0) | 0.241090 | 0.681342 | 0.205451 | 0.852174 | 1.250729 | 0.041186 | 2.155630 | 0.264150 |
| 19 | (Fatal or Serious, 91=1/121=0) | (Acts) | 0.245283 | 0.637317 | 0.205451 | 0.837607 | 1.314271 | 0.049128 | 2.233366 | 0.316837 |
| 20 | (Acts, 91=1/121=0) | (Fatal or Serious) | 0.515723 | 0.356394 | 0.205451 | 0.398374 | 1.117791 | 0.021650 | 1.069777 | 0.217599 |
| 21 | (Fatal or Serious) | (Acts, 91=1/121=0) | 0.356394 | 0.515723 | 0.205451 | 0.576471 | 1.117791 | 0.021650 | 1.143431 | 0.163731 |
| 22 | (Acts) | (Fatal or Serious, 91=1/121=0) | 0.637317 | 0.245283 | 0.205451 | 0.322368 | 1.314271 | 0.049128 | 1.113757 | 0.659313 |
| 23 | (91=1/121=0) | (Fatal or Serious, Acts) | 0.681342 | 0.241090 | 0.205451 | 0.301538 | 1.250729 | 0.041186 | 1.086545 | 0.629095 |
rules[rules['lift'] >= 1].sort_values(by='lift',ascending=False).head(30)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | zhangs_metric | |
|---|---|---|---|---|---|---|---|---|---|---|
| 16 | (Performance-Based Errors) | (Acts, 91=1/121=0) | 0.477987 | 0.515723 | 0.387841 | 0.811404 | 1.573331 | 0.141331 | 2.567793 | 0.698079 |
| 13 | (Acts, 91=1/121=0) | (Performance-Based Errors) | 0.515723 | 0.477987 | 0.387841 | 0.752033 | 1.573331 | 0.141331 | 2.105165 | 0.752475 |
| 1 | (Performance-Based Errors) | (Acts) | 0.477987 | 0.637317 | 0.477987 | 1.000000 | 1.569079 | 0.173358 | inf | 0.694779 |
| 15 | (Acts) | (Performance-Based Errors, 91=1/121=0) | 0.637317 | 0.387841 | 0.387841 | 0.608553 | 1.569079 | 0.140663 | 1.563836 | 1.000000 |
| 14 | (Performance-Based Errors, 91=1/121=0) | (Acts) | 0.387841 | 0.637317 | 0.387841 | 1.000000 | 1.569079 | 0.140663 | inf | 0.592466 |
| 0 | (Acts) | (Performance-Based Errors) | 0.637317 | 0.477987 | 0.477987 | 0.750000 | 1.569079 | 0.173358 | 2.088050 | 1.000000 |
| 19 | (Fatal or Serious, 91=1/121=0) | (Acts) | 0.245283 | 0.637317 | 0.205451 | 0.837607 | 1.314271 | 0.049128 | 2.233366 | 0.316837 |
| 22 | (Acts) | (Fatal or Serious, 91=1/121=0) | 0.637317 | 0.245283 | 0.205451 | 0.322368 | 1.314271 | 0.049128 | 1.113757 | 0.659313 |
| 18 | (Fatal or Serious, Acts) | (91=1/121=0) | 0.241090 | 0.681342 | 0.205451 | 0.852174 | 1.250729 | 0.041186 | 2.155630 | 0.264150 |
| 23 | (91=1/121=0) | (Fatal or Serious, Acts) | 0.681342 | 0.241090 | 0.205451 | 0.301538 | 1.250729 | 0.041186 | 1.086545 | 0.629095 |
| 12 | (Acts, Performance-Based Errors) | (91=1/121=0) | 0.477987 | 0.681342 | 0.387841 | 0.811404 | 1.190891 | 0.062168 | 1.689630 | 0.307066 |
| 3 | (91=1/121=0) | (Performance-Based Errors) | 0.681342 | 0.477987 | 0.387841 | 0.569231 | 1.190891 | 0.062168 | 1.211815 | 0.503023 |
| 2 | (Performance-Based Errors) | (91=1/121=0) | 0.477987 | 0.681342 | 0.387841 | 0.811404 | 1.190891 | 0.062168 | 1.689630 | 0.307066 |
| 17 | (91=1/121=0) | (Acts, Performance-Based Errors) | 0.681342 | 0.477987 | 0.387841 | 0.569231 | 1.190891 | 0.062168 | 1.211815 | 0.503023 |
| 8 | (Acts) | (91=1/121=0) | 0.637317 | 0.681342 | 0.515723 | 0.809211 | 1.187672 | 0.081493 | 1.670209 | 0.435688 |
| 9 | (91=1/121=0) | (Acts) | 0.681342 | 0.637317 | 0.515723 | 0.756923 | 1.187672 | 0.081493 | 1.492052 | 0.495881 |
| 20 | (Acts, 91=1/121=0) | (Fatal or Serious) | 0.515723 | 0.356394 | 0.205451 | 0.398374 | 1.117791 | 0.021650 | 1.069777 | 0.217599 |
| 21 | (Fatal or Serious) | (Acts, 91=1/121=0) | 0.356394 | 0.515723 | 0.205451 | 0.576471 | 1.117791 | 0.021650 | 1.143431 | 0.163731 |
| 7 | (Acts) | (Fatal or Serious) | 0.637317 | 0.356394 | 0.241090 | 0.378289 | 1.061436 | 0.013954 | 1.035218 | 0.159588 |
| 6 | (Fatal or Serious) | (Acts) | 0.356394 | 0.637317 | 0.241090 | 0.676471 | 1.061436 | 0.013954 | 1.121022 | 0.089931 |
| 10 | (Fatal or Serious) | (91=1/121=0) | 0.356394 | 0.681342 | 0.245283 | 0.688235 | 1.010118 | 0.002457 | 1.022111 | 0.015563 |
| 11 | (91=1/121=0) | (Fatal or Serious) | 0.681342 | 0.356394 | 0.245283 | 0.360000 | 1.010118 | 0.002457 | 1.005634 | 0.031433 |
plt.scatter(rules['support'], rules['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()
plt.scatter(rules['support'], rules['lift'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('lift')
plt.title('Support vs Lift')
plt.show()
fit = np.polyfit(rules['lift'], rules['confidence'], 1)
fit_fn = np.poly1d(fit)
plt.plot(rules['lift'], rules['confidence'], 'yo', rules['lift'],
fit_fn(rules['lift']))
[<matplotlib.lines.Line2D at 0x26284d24460>, <matplotlib.lines.Line2D at 0x26284d24430>]